A comprehensive risk factor analysis using association rules in people with diabetic kidney disease

Association rule is a transparent machine learning method expected to share information about risks for chronic kidney disease (CKD) among diabetic patients, but its findings in clinical data are limited. We used the association rule to evaluate the risk for kidney disease in General and Worker diabetic cohorts. The absence of risk factors was examined for association with stable kidney function and worsening kidney function. A confidence value was used as an index of association, and a lift of > 1 was considered significant. Analyses were applied for individuals stratified by KDIGO’s (Kidney Disease: Improving Global Outcomes) CKD risk categories. A General cohort of 4935 with a mean age of 66.7 years and a Worker cohort of 2153 with a mean age of 47.8 years were included in the analysis. Good glycemic control was significantly related to stable kidney function in low-risk categories among the General cohort, and in very-high risk categories among the Worker cohort; confidences were 0.82 and 0.77, respectively. Similar results were found with poor glycemic control and worsening kidney function; confidences of HbA1c were 0.41 and 0.27, respectively. Similarly, anemia, obesity, and hypertension showed significant relationships in the low-risk General and very-high risk Worker cohorts. Stratified risk assessment using association rules revealed the importance of the presence or absence of risk factors.


Methods
Study participants. In this study, two populations with diabetes were separately analyzed. One was a general diabetic population cohort (General cohort), and the other was a worker-with-diabetes cohort, comprised of people insured by the Toshiba Health Insurance Society (Worker cohort). In both cohorts, people were classified as having diabetes mellitus by the following criteria: glycated hemoglobin (HbA1c) ≥ 6.5%, fasting plasma glucose ≥ 7.0 mmol/L (≥ 126 mg/dL), or treatment of diabetes mellitus 6 . The General cohort consisted of adults who underwent annual health examinations in Kanazawa, Ishikawa, Japan, between 1999 and 2018; adults aged ≥ 40 years who were not covered by company insurance were eligible to undergo the health examinations. The Worker cohort consisted of employees of a Japanese company (Toshiba Corporation), and information from annual health examinations from 2010 to 2016 was used. Both cohorts were eligible for analysis if they had serum creatinine levels measured and had at least one follow-up visit during the observation period. There was no upper age limit for the subjects. Those who refused to participate, did not have baseline information on risk factors or eGFR, or were not followed up for 5 years were excluded from the analysis. Baseline was defined as the oldest year in a data series in which eGFR is measured and then followed for at least 5 years.
Measurement of risk factors. Risk factors recorded in the health examination were used in the analysis.
The following risk factors were assessed: eGFR, urinary protein, glycohemoglobin (HbA1c), hemoglobin, aspartate aminotransferase (AST), alanine aminotransferase (ALT), γ-glutamyltransferase (GGT), total cholesterol, triglyceride, high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, systolic blood pressure, diastolic blood pressure, body mass index (BMI), diabetic retinopathy, and 2-year decrease in eGFR from baseline. From the serum creatinine level measured by the enzymatic method, eGFR was calculated using the formula for the Japanese population 7 . Urine was collected randomly during the day and measured by dipstick and classified as negative/trace or ≥ 1+ (1+ corresponds to a urine protein of about 30 mg/dL). Blood pressure was measured in the sitting position after resting.
Outcomes. The outcome of worsening kidney function was defined as a ≥ 30% decrease in eGFR from baseline during the follow-up period as a surrogate marker for end-stage kidney disease 8 . In the Worker cohort, the following status was also included in the definition: diagnosis of end-stage kidney disease, initiation of maintenance hemodialysis, and kidney transplantation. The outcome of stable kidney function was defined as not achieving the outcome of worsening kidney function during the follow-up period, which was a < 30% decrease in eGFR from baseline during the follow-up period.
Statistical analysis. In this study, we used association rules 5,9 to examine the relationship between risk factors and outcomes. For a comprehensive analysis, we examined the absence of risk factors and stable kidney function, as well as the presence of risk factors and worsening kidney function.
The association rule is an indicator originally used in market analysis, but it is also being applied in medical research 10 . We used the value of confidence as an indicator of the association strength between the risk factor and the outcome. Confidence stable is the ratio of people with stable kidney function among those with an absence of risk factors, that is expressed by the following formula. Similarly, Confidence worsening is expressed as follows. Lift indicates the significance of the relationship between X and Y. In general, a lift value greater than 1.0 indicates that X and Y are dependent on each other. The lift was used to confirm whether the association indicated by the confidence was not by chance, considering the association found in the overall data. Lift is defined by the following equation. We considered confidence to be significant when lift > 1.0 11 . For visual description, see Appendix Fig. 1.
Thresholds of the value of risk factors were set at the upper or lower 20% of the population by sex, and the presence or absence of each risk factor was classified according to the threshold. For example, individuals with systolic blood pressure values above the 80th percentile or with urinary protein ≥ 1+ were classified as having a risk factor. The criteria for determining the presence or absence of each risk factor and the values of the threshold are shown in Appendix Table 1.
Confidence stable = n absence of a risk factor and stable kidney function n absence of a risk factor Confidence worsening = n presence of risk factor and with worsening kidney function n presence of a risk factor Lift stable = n absence of a risk factor and stable kidney function n absence of a risk factor n stable kidney function n all Lift worsening = n presence of a risk factor and worsening kidney function n presence of a risk factor n worsening of kidney function n all www.nature.com/scientificreports/ Association rules are usually found on a set of multiple risk factors associated with an outcome. However, a number of discovered rules are often clinically difficult to interpret 12 . To solve the problem, we conducted a stratified analysis according to the clinically important status. The background used for stratification was (1) CKD risk categories, (2) diabetic retinopathy, and (3) eGFR decline rate over 2 years. CKD risk categories are based on KDIGO's (Kidney Disease: Improving Global Outcomes) risk categories 13 , with albuminuria replaced by dipstick proteinuria (Appendix Fig. 2). Analysis was performed for overall participants as well as participants stratification by (1), (1) and (2), as well as (1), (2), and (3).
All analyses were performed using Stata/MP statistical software (version 17; StataCorp LP, College Station, TX, USA). The graphs were plotted using the ggplot2 package (version 3.4.2) in R (version 4.2.2).

Results
Selection of participants and baseline characteristics. For the General cohort, 4935 people with diabetes met the eligibility criteria (Appendix Fig. 3A). For the Worker cohort, 2153 people with diabetes met the inclusion criteria (Appendix Fig. 3B). Table 1 shows the participants' baseline characteristics. The General cohort had a mean age of 66.7 years, and about half of the patients were women. The mean eGFR was 75.0 mL/ min/1.73 m 2 , and 11.6% of people had proteinuria ≥ 1+. The Worker cohort had a mean age of 47.8 years, and the majority were men. The mean eGFR was 80.1 mL/min/1.73 m 2 , and many participants had preserved kidney function.
During the follow-up period of 5 years, 237 (4.8%) people in the General cohort achieved worsening kidney function, meaning 95.2% of people achieved stable kidney function outcomes. On the other hand, 110 (5.1%) people in the Worker cohort achieved the outcome of worsening kidney function, and 94.9% of people achieved stable kidney function outcomes.
Association rules for the overall population. Figure 1A shows the results of the analysis using association rules for the overall General cohort. For stable kidney function, most risk factors' confidence stable was ≥ 0.7 with significance (lift stable > 1). For worsening kidney function, compared to other risk factors, higher confidence worsening was observed in HbA1c and fasting plasma glucose; both had confidence worsening of > 0.4. The confidence worsening of urine protein was 0.28 and had the highest lift worsening of 2.45 among all risk factors. Figure 1B shows the results of the association rules for the overall Worker cohort. For stable kidney function, similar to the General cohort, most risk factors' confidence stable were > 0.9 with significance. For worsening kidney function, most of the confidence worsening were lower than those of General cohort. HbA1c and fasting plasma glucose were with the highest confidence worsening among risk factors and with lift worsening > 2.0.
Stratified analysis using association rules by CKD risk categories. Stratified analyses using association rules were conducted by CKD risk categories. Results for the General cohort are shown in Fig. 2A. For stable kidney function, in the low and moderate risk categories, high confidence stable was observed for most risk factors. HbA1c and fasting plasma glucose had significant confidence stable in the low-risk category (0.82 and 0.83, Table 1. Baseline characteristics of the study population. Continuous variables are expressed as mean ± standard deviation, or median (25th and 75th percentiles). Categorical variables are expressed as numbers (percentages). AST aspartate aminotransferase, ALT alanine aminotransferase, eGFR estimated glomerular filtration rate, GGT γ-glutamyl transferase, HbA1c glycated hemoglobin, HDL high-density lipoprotein, LDL low-density lipoprotein. *n = 2894.

General diabetic population cohort (General cohort)
Worker with diabetes cohort (Worker cohort)  Fig. 2B. For stable kidney function, conversely from the General cohort, significant confidence stable were mainly observed in the high and very-high risk categories. In the very-high risk category, HbA1c, hemoglobin, total cholesterol, triglyceride, and BMI had significant confidence stable (0.77, 0.85, 0.85, 0.84, and 0.84, respectively). For worsening kidney function, similar to the stable kidney function, risk factors with significant confidence worsening were observed in the very-high risk categories (confidence worsening values of hemoglobin, total cholesterol, triglyceride, HDL cholesterol, and BMI were 0.31, 0.50, 0.36, 0.31, and 0.43, respectively).
Stratified analysis using association rules by the combination of CKD risk categories and diabetic retinopathy. Stratified analyses using association rules were conducted by a combination of CKD risk categories and diabetic retinopathy. Results for the General cohort are shown in Fig. 3A. For stable kidney function, confidence stable of most risk factors was higher in the low/moderate risk categories than in the high/ very-high risk categories, regardless of the prevalence of retinopathy. Among the low/moderate risk categories, A General cohort (n=4,935) B Worker cohort (n=2,153) Figure 1. Analysis using association rules between kidney outcomes and without/with risk factors. Analysis using association rules for General cohort (A) and Worker cohort (B). The size of the circles indicates confidence, and the strength of the color indicates lift. Blue circles show the association between the absence of risk stable kidney function, and red circles show the association between the presence of risk and worsening kidney function. When the lift is ≤ 1, the circles are grayed out. *n = 2894. AST aspartate aminotransferase, ALT alanine aminotransferase, eGFR estimated glomerular filtration rate, GGT γ-glutamyl transferase, HbA1c glycated hemoglobin, HDL high-density lipoprotein, LDL low-density lipoprotein. www.nature.com/scientificreports/ HbA1c, fasting plasma glucose, and hemoglobin showed higher confidence stable in the subgroup without retinopathy than in those with retinopathy. For worsening kidney function, like stable kidney function, almost all significant risk factors showed higher confidence worsening in the low/moderate risk categories than in the high/ very-high risk categories, regardless of the prevalence of retinopathy. Biomarkers of liver injury such as AST, ALT, and GGT showed significant confidence worsening in the high/very-high risk categories with retinopathy. In the category, triglycerides had the highest lift worsening (2.75). The same analyses for the Worker cohort are shown in Fig. 3B. For stable kidney function, in patients without diabetic retinopathy, significant risk factors were found in the high/very-high risk categories. In patients with diabetic retinopathy, the values of confidence stable were similar between high/very-high risk categories compared to the low/moderate risk categories, but higher lift stable were observed in the high/very-high risk categories. For worsening kidney function, regardless of the prevalence of retinopathy, risk factors of significant higher confidence worsening were found in the high/very-high risk categories. In addition, patients in the high/very-high risk categories with retinopathy showed higher confidence worsening than those without retinopathy for risk factors of glycemic control, as well as lipid and blood pressure control. In the subgroup, unlike the General cohort, AST, ALT, and GGT were not significant. In the subgroup of low/moderate risk categories with diabetic retinopathy, A General cohort (n=4,935) B Worker cohort (n=2,153) Figure 2. Analysis using association rules between kidney outcomes and without/with risk factors stratified by the combination of CKD risk categories. Analysis using association rules for General cohort (A) and Worker cohort (B). The size of the circles indicates confidence, and the strength of the color indicates lift. Blue circles show the association between the absence of risk stable kidney function, and red circles show the association between the presence of risk and worsening kidney function. When the lift is ≤ 1, the circles are grayed out. *n = 2894. AST aspartate aminotransferase, ALT alanine aminotransferase, eGFR estimated glomerular filtration rate, GGT γ-glutamyl transferase, HbA1c glycated hemoglobin, HDL high-density lipoprotein, LDL low-density lipoprotein. www.nature.com/scientificreports/ the confidence worsening of HbA1c, total cholesterol, and blood pressure were higher than other risk factors, each with high significance (lift worsening > 3.0). Additional stratification by eGFR change for two years was conducted for subgroups with data on eGFR change for the first two years of observation. The results for the General cohort are shown in Appendix Fig. 4A. For stable kidney function, in the subgroup of eGFR change ≥ − 30% and < 0%, more risk factors had significant confidence stable compared to that of eGFR change ≥ 0%. No significant association was found for the subgroup of eGFR change < − 30% because no participants achieved the outcome of stable kidney function. For worsening kidney function, even with the stratification of eGFR change, significant confidence worsening of risk factors-including HbA1c-were mainly found in the low/moderate risk categories. For the Worker cohort, like the General cohort, the significant confidences were mainly found in the strata of eGFR change ≥ − 30% and < 0% for both the stable and worsening kidney function (Appendix Fig. 4B).
To confirm the robustness of the results, we set the risk factor value thresholds to the upper or lower 10% of the population according to sex, similar to that done for 20% threshold, and classified the presence or absence of each risk factor according to the threshold. Similar results were obtained for both the presence and absence of risk factors (Appendix Figs. 5, 6, 7, 8).
A General cohort (n=4,935) B Worker cohort (n=2,153) Figure 3. Analysis using association rules between kidney outcomes and without/with risk factors stratified by the combination of CKD risk categories and diabetic retinopathy. Analysis using association rules for General cohort (A) and Worker cohort (B). The size of the circles indicates confidence, and the strength of the color indicates lift. Blue circles show the association between the absence of risk stable kidney function, and red circles show the association between the presence of risk and worsening kidney function. When the lift is ≤ 1, the circles are grayed out. *n = 2894. AST aspartate aminotransferase, ALT alanine aminotransferase, eGFR estimated glomerular filtration rate, GGT γ-glutamyl transferase, HbA1c glycated hemoglobin, HDL highdensity lipoprotein, LDL low-density lipoprotein.

Clinical significance and study implications.
Our study confirmed the associations between risk factors and loss of kidney function, which have been reported in previous studies. For example, urinary protein 1 , poor glycemic control 14 , and hypertension 15 have been reported as risk factors for kidney dysfunction.
In general people with diabetes, our results showed that risk factors' relationships were found in the lowrisk group. Higher age was the major characteristic of this cohort. The less strict treatment goals for elderly diabetic patients with kidney disease, such as blood glucose and blood pressure, are the standard care 16 , but a study reported similar effects of glycemic control on CKD in middle-aged and elderly diabetics 17 . The detailed analysis by risk category in our study indicated the importance of risk factors among the low-risk group, even in older people.
In workers with diabetes, risk factors were associated with renal events in the high-risk group, while these associations were weaker in the low-risk group. This is most likely because the low-risk group had glomerular hyperfiltration, which is seen in the early stages of diabetic kidney disease 18 and may have masked kidney damage. That might explain the relationships of blood glucose and blood pressure for worsening kidney function observed among the low-risk group with diabetic retinopathy. The results of this study do not deny the importance of early risk management among working-age people.
The absence of risk factors is also important for risk management. For example, good glycemic and blood pressure control were associated with stable kidney function, which is consistent with previous cohort studies 19,20 . Furthermore, similar results were found in those by CKD risk categories. The results show that absence of these risk factors is associated with stable kidney function and is in line with the guideline treatment strategy 21 .
The relationship between dyslipidemia and kidney dysfunction was shown in cases of high-risk categories and complicated retinopathy. Studies examining the relationship between dyslipidemia and kidney dysfunction have been limited 22 , but a study of advanced diabetic nephropathy in type 1 diabetes has shown results similar to ours 23 . In our study, LDL cholesterol was not associated with prognosis. This is consistent with previous observational studies in non-dialysis patients, which also reported no significant association between LDL cholesterol and renal prognosis 24 . Triglycerides, which were found to be associated with kidney events in our study, have also been associated with CKD progression and cardiovascular events 25 . LDL apheresis, a new treatment for advanced DKD patients, improves life prognosis but does not change eGFR loss 26 . Our study did not assess causality, and treatment of dyslipidemia in DKD requires further investigation.
The results of the General cohort showed that liver injury (high value of AST, ALT, or GGT) related to worsening kidney function. Studies reported that non-alcoholic fatty liver disease (NAFLD) is related to increased risk of CKD 27 , both in diabetic and non-diabetic individuals 28 . The association in the group with retinopathy has been reported in previous study 29 . This suggests that liver injury may be an important risk factor in advanced diabetic kidney disease.
Our study demonstrated the relevance of risk factors by stratifying the subjects. Although the detailed stratified analysis method is straightforward, increasing the number of strata frequently results in the "curse of dimensionality," a situation in which the number of risk factor combinations increases dramatically 30 . This is often problematic for machine learning methods. Expert or data-driven methods for identifying important features have been proposed 30 , but the interpretability of the results must still be considered.

Strengths and limitations.
One of the strengths of this study was the analysis used association rules stratified by known risk status, which made the results easy to interpret. Stratification by presence or absence of risk factors may also be helpful in sharing the risk assessment with patients. Furthermore, the General and Worker cohorts were complementary as a general population, and comparing the similarities and differences between them allowed for a better interpretation of risk factors. Conversely, this study has several limitations. First, analysis using the association rule does not allow for multivariable analysis; therefore, adjustment by covariates could not be performed. Second, this study used a combination of confidence and lift value to provide transparency. However, when using logistic regression, confidence values may be reported along with confidence intervals, which may be comprehensible in some situations. Third, there is no widely-accepted, clinically meaningful threshold of confidence value. Fourth, the threshold for presence or absence of each risk factor was set uniformly, which may not have been a clinically meaningful classification. Application of a stricter threshold may be appropriate for some risk factors. Fifth, dipstick measurement of proteinuria is known to have high false-positive rates 31 . Therefore, there is a possibility that people in this study who were classified as "presence of risk factors" based on the dipstick test may not actually have risk factors. Sixth, because the number of events of www.nature.com/scientificreports/ worsening kidney function in our cohorts was small in some strata, the confidence and lift could be altered by small changes in the number of events. Seventh, information on treatment was limited and could not be included in the analysis.

Conclusion
This study used association rules for assessing the relationships between risk factors and kidney events in two diabetic cohorts. Most results were similar to previous reports, and stratified analysis allowed assessments by risk categories and retinopathy. In the future, validation studies are needed in cohorts with similar backgrounds. Analysis using the association rule is a transparent method and its efficacy for sharing CKD risk information with patients might be warranted.

Data availability
The datasets generated during and/or analyzed during the current study are not publicly available due to their containing information that could compromise the privacy of research participants but are available from the corresponding author on reasonable request.